Read and Writing Data

In R, there are many functions/packages to read all kinds of datasets. Some examples are:

There are also many ways for wrinting data files

CSV stands for Comma Separated Values.

Reading Data files with read.table

For small to moderately sized datasets, you can usually call read.table without specifying any other arguments

Let’s use read.table() to read the text file Employee Sales

First, let’s locate our current working Directory

getwd()

Now, let’s display the list of files present in that directory

list.files()
## [1] "03_meeting_introduction_to_R_p2.html"
## [2] "03_meeting_introduction_to_R_p2.Rmd" 
## [3] "Employees.csv"                       
## [4] "EmployeeSales.txt"
employee_sales <- read.table("EmployeeSales.txt", header=TRUE, sep=",")
is.data.frame(employee_sales)
## [1] TRUE
employee_sales
##    Employee_ID FirstName LastName           Education     Occupation
## 1            1      John     Yang           Bachelors   Professional
## 2            2       Rob  Johnson           Bachelors     Management
## 3            3     Ruben   Torres     Partial College Skilled Manual
## 4            4   Christy      Zhu           Bachelors   Professional
## 5            5       Rob    Huang         High School Skilled Manual
## 6            6      John     Ruiz           Bachelors   Professional
## 7            7      John   Miller      Masters Degree     Management
## 8            8   Christy    Mehta Partial High School       Clerical
## 9            9       Rob  Verhoff Partial High School       Clerical
## 10          10   Christy  Carlson     Graduate Degree     Management
## 11          11      Gail Erickson           Education   Professional
## 12          12     Barry  Johnson           Education     Management
## 13          13     Peter    Krebs     Graduate Degree       Clerical
## 14          14      Greg Alderson Partial High School       Clerical
##    YearlyIncome     Sales
## 1         90000 3578.2700
## 2         80000 3399.9900
## 3         50000  699.0982
## 4         80000 3078.2700
## 5         60000 2319.9900
## 6         70000  539.9900
## 7         80000 2320.4900
## 8         50000   24.9900
## 9         45000   24.9900
## 10        70000 2234.9900
## 11        90000 4319.9900
## 12        80000 4968.5900
## 13        50000   59.5300
## 14        45000   23.5000
ncol(employee_sales)
## [1] 7
nrow(employee_sales)
## [1] 14
dim(employee_sales)
## [1] 14  7

In this case, R will automatically

  • figure out how many rows there are (and how much memory needs to be allocated)
  • figure what type of variable is in each column of the table.

Telling R all these things directly makes R run faster and more efficiently. The read.csv() function is identical to read.table except that some of the defaults are set differently (like the sep argument).

Reading the Employees.csv file using read.csv().

employee <- read.csv("Employees.csv", TRUE, sep = ",")
employee
##    FirstName LastName           Education     Occupation YearlyIncome
## 1       John     Yang           Bachelors   Professional        90000
## 2        Rob  Johnson           Bachelors     Management        80000
## 3      Ruben   Torres     Partial College Skilled Manual        50000
## 4    Christy      Zhu           Bachelors   Professional        80000
## 5        Rob    Huang         High School Skilled Manual        60000
## 6       John     Ruiz           Bachelors   Professional        70000
## 7       John   Miller      Masters Degree     Management        80000
## 8    Christy    Mehta Partial High School       Clerical        50000
## 9        Rob  Verhoff Partial High School       Clerical        45000
## 10   Christy  Carlson     Graduate Degree     Management        70000
## 11      Gail Erickson           Education   Professional        90000
## 12     Barry  Johnson           Education     Management        80000
## 13     Peter    Krebs     Graduate Degree       Clerical        50000
## 14      Greg Alderson Partial High School       Clerical        45000
##      HireDate
## 1  2018-01-06
## 2  2009-12-10
## 3  2009-12-11
## 4  2008-12-12
## 5  2012-09-08
## 6  2006-07-09
## 7  2012-08-09
## 8  2005-07-07
## 9  2015-09-13
## 10 2015-01-14
## 11 2002-10-06
## 12 2015-05-14
## 13 2014-01-13
## 14 2005-07-13
str(employee)
## 'data.frame':    14 obs. of  6 variables:
##  $ FirstName   : Factor w/ 8 levels "Barry","Christy",..: 5 7 8 2 7 5 5 2 7 2 ...
##  $ LastName    : Factor w/ 13 levels "Alderson","Carlson",..: 12 5 10 13 4 9 8 7 11 2 ...
##  $ Education   : Factor w/ 7 levels "Bachelors","Education",..: 1 1 6 1 4 1 5 7 7 3 ...
##  $ Occupation  : Factor w/ 4 levels "Clerical","Management",..: 3 2 4 3 4 3 2 1 1 2 ...
##  $ YearlyIncome: int  90000 80000 50000 80000 60000 70000 80000 50000 45000 70000 ...
##  $ HireDate    : Factor w/ 14 levels "2002-10-06","2005-07-07",..: 14 6 7 5 9 4 8 2 13 11 ...

Now that we know how to read table and csv files from the package utils (one of the several built in packages that comes when you install R), we will learn how to read them using much faster functions that are able to deal with larger flat files and quickly.

We will need to load the package readr

library(readr)
## Warning: package 'readr' was built under R version 3.5.2

The package readr supports seven file formats with seven read_ functions:

employee_sales2 <- read_delim("EmployeeSales.txt", delim = ",")
## Parsed with column specification:
## cols(
##   Employee_ID = col_double(),
##   FirstName = col_character(),
##   LastName = col_character(),
##   Education = col_character(),
##   Occupation = col_character(),
##   YearlyIncome = col_double(),
##   Sales = col_double()
## )
employee_sales2
## # A tibble: 14 x 7
##    Employee_ID FirstName LastName Education  Occupation YearlyIncome  Sales
##          <dbl> <chr>     <chr>    <chr>      <chr>             <dbl>  <dbl>
##  1           1 John      Yang     Bachelors  Professio~        90000 3578. 
##  2           2 Rob       Johnson  Bachelors  Management        80000 3400. 
##  3           3 Ruben     Torres   Partial C~ Skilled M~        50000  699. 
##  4           4 Christy   Zhu      Bachelors  Professio~        80000 3078. 
##  5           5 Rob       Huang    High Scho~ Skilled M~        60000 2320. 
##  6           6 John      Ruiz     Bachelors  Professio~        70000  540. 
##  7           7 John      Miller   Masters D~ Management        80000 2320. 
##  8           8 Christy   Mehta    Partial H~ Clerical          50000   25.0
##  9           9 Rob       Verhoff  Partial H~ Clerical          45000   25.0
## 10          10 Christy   Carlson  Graduate ~ Management        70000 2235. 
## 11          11 Gail      Erickson Education  Professio~        90000 4320. 
## 12          12 Barry     Johnson  Education  Management        80000 4969. 
## 13          13 Peter     Krebs    Graduate ~ Clerical          50000   59.5
## 14          14 Greg      Alderson Partial H~ Clerical          45000   23.5

You also can read a csv file using read_csv()

employee2 <- read_csv("Employees.csv", TRUE)
## Parsed with column specification:
## cols(
##   FirstName = col_character(),
##   LastName = col_character(),
##   Education = col_character(),
##   Occupation = col_character(),
##   YearlyIncome = col_double(),
##   HireDate = col_date(format = "")
## )
employee2
## # A tibble: 14 x 6
##    FirstName LastName Education        Occupation   YearlyIncome HireDate  
##    <chr>     <chr>    <chr>            <chr>               <dbl> <date>    
##  1 John      Yang     Bachelors        Professional        90000 2018-01-06
##  2 Rob       Johnson  Bachelors        Management          80000 2009-12-10
##  3 Ruben     Torres   Partial College  Skilled Man~        50000 2009-12-11
##  4 Christy   Zhu      Bachelors        Professional        80000 2008-12-12
##  5 Rob       Huang    High School      Skilled Man~        60000 2012-09-08
##  6 John      Ruiz     Bachelors        Professional        70000 2006-07-09
##  7 John      Miller   Masters Degree   Management          80000 2012-08-09
##  8 Christy   Mehta    Partial High Sc~ Clerical            50000 2005-07-07
##  9 Rob       Verhoff  Partial High Sc~ Clerical            45000 2015-09-13
## 10 Christy   Carlson  Graduate Degree  Management          70000 2015-01-14
## 11 Gail      Erickson Education        Professional        90000 2002-10-06
## 12 Barry     Johnson  Education        Management          80000 2015-05-14
## 13 Peter     Krebs    Graduate Degree  Clerical            50000 2014-01-13
## 14 Greg      Alderson Partial High Sc~ Clerical            45000 2005-07-13

Other types of data

To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:

  • haven reads SPSS, Stata, and SAS files.

  • readxl reads excel files (both .xls and .xlsx).

  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

Subset R Objects

Vectors

To Subset vectors we use the [ operator.

x <- c("a", "b", "c", "c", "d", "a")  
x[1]    ## Extract the first element
## [1] "a"
x[2]    ## Extract the second element
## [1] "b"

The [ operator can be used to extract multiple elements of a vector by passing the operator an integer sequence.

x[1:4]
## [1] "a" "b" "c" "c"

The sequence does not have to be in order; you can specify any arbitrary integer vector.

x[c(1, 3, 4)]
## [1] "a" "c" "c"

We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a given condition.

x[x > "a"]
## [1] "b" "c" "c" "d"

Matrix

x <- matrix(1:6, 2, 3)
x
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
x[1, 2]
## [1] 3
x[1, ] # extract row 1
## [1] 1 3 5

List

x <- list(foo = 1:4, bar = 0.6, baz = "hello")
x
## $foo
## [1] 1 2 3 4
## 
## $bar
## [1] 0.6
## 
## $baz
## [1] "hello"

The [[ operator can be used to extract single elements from a list. Here we extract the first element of the list.

x[[1]]
## [1] 1 2 3 4
x[["bar"]] # extract element by named indices
## [1] 0.6
x$bar #  extract elements by name
## [1] 0.6
x[c(1, 3)] # extract the first and third elements of a list
## $foo
## [1] 1 2 3 4
## 
## $baz
## [1] "hello"

Data frame

employee[,c(1,2,4)] # extract column 1, 2, and 4
##    FirstName LastName     Occupation
## 1       John     Yang   Professional
## 2        Rob  Johnson     Management
## 3      Ruben   Torres Skilled Manual
## 4    Christy      Zhu   Professional
## 5        Rob    Huang Skilled Manual
## 6       John     Ruiz   Professional
## 7       John   Miller     Management
## 8    Christy    Mehta       Clerical
## 9        Rob  Verhoff       Clerical
## 10   Christy  Carlson     Management
## 11      Gail Erickson   Professional
## 12     Barry  Johnson     Management
## 13     Peter    Krebs       Clerical
## 14      Greg Alderson       Clerical
employee[employee$Education == "Bachelors",] # subset only the professionals with Bachelor degree 
##   FirstName LastName Education   Occupation YearlyIncome   HireDate
## 1      John     Yang Bachelors Professional        90000 2018-01-06
## 2       Rob  Johnson Bachelors   Management        80000 2009-12-10
## 4   Christy      Zhu Bachelors Professional        80000 2008-12-12
## 6      John     Ruiz Bachelors Professional        70000 2006-07-09

Removing NA Values

x <- c(1, 2, NA, 4, NA, 5)
x
## [1]  1  2 NA  4 NA  5
bad <- is.na(x)
bad
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE
x[!bad]
## [1] 1 2 4 5

The dataset airquality is available in R

?airquality
head(airquality,15)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
## 11     7      NA  6.9   74     5  11
## 12    16     256  9.7   69     5  12
## 13    11     290  9.2   66     5  13
## 14    14     274 10.9   68     5  14
## 15    18      65 13.2   58     5  15
good <- complete.cases(airquality)
head(airquality[good, ])
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

Exercise 1: Extract from employee dataset the employees with YearlyIncome > 70000

Work with Dates and Times

Times are represented by the POSIXct or the POSIXlt class

date_ex <- c("1990-02-05")
class(date_ex)
## [1] "character"
date_ex <- as.Date(date_ex)
class(date_ex)
## [1] "Date"

You can extract pieces of dates and/or times.

weekdays(date_ex)
## [1] "Monday"
months(date_ex)
## [1] "February"
quarters(date_ex)
## [1] "Q1"

You can get Current Date and Time using the following functions.

Sys.time()
Sys.Date()

Exercise 2: Extract weekday, month, quarter, and day of the week from today’s date.

We can use mathematical operations on dates and times. We can do + and - and comparisons too (i.e. ==, <=)

x <- as.Date("2018-03-04") 
y <- as.Date("2018-04-05") 

x+7 # Add 7 days to the date
## [1] "2018-03-11"
x-y
## Time difference of -32 days
x > y # logical statement
## [1] FALSE

Exercise 3: Extract the weekdays and months of the HireDate in the employee dataset.

Yes, we did it again!!!

Congratulations ladies and thanks for coming today!

Source used for this code can be found here